In order to predict the fantasy scores of NBA players in the 2021-2022 season I pulled data from basketball reference using nbastatR, grabbing the last 20 years of season totals from players, in both normal and advanced stats, which can conveniently be done with just one line of code!
df <- nbastatR::bref_players_stats(seasons = 2000:2021, tables = c("totals", "advanced"), widen = TRUE, assign_to_environment = F)
The only initial data manipulation is to create the response variable for the dataset, the fantasy points, which is done with a very simple formula in my league, the coefficients are arbitrarily chosen and lead to some interesting weighting in our modelling later on.
coefs <- tibble::tibble(Points = 1,
Rebounds = 1.2,
Assists = 1.5,
Steals = 3,
Blocks = 3,
Turnovers = -1,
`3 Point FGM` = 1) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Multiplier") %>%
gt::gt() %>%
data_color(
columns = c(Multiplier),
colors = scales::col_numeric(
palette = paletteer::paletteer_d(
palette = "rcartocolor::Mint"
) %>% as.character(),
domain = NULL
)
) %>%
DGThemes::gt_theme_duncan()
coefs %>%
tab_header(title = "Multipliers for Points in my Fantasy League")
The data we have is tabular, and can easily be setup an id which I did by adding a column that is just the season and player name pasted together (with additional notation of 01 to ensure players with the same name don’t get merged) in the format that is slightly more graphically represented below with the top 10 fantasy seasons ever in my league.
| id | fantasy_points | url_player_headshot |
|---|---|---|
| westbru01_2017 | 5105.8 | |
| hardeja01_2019 | 4957.6 | |
| hardeja01_2017 | 4782.3 | |
| onealsh01_2000 | 4688.1 | |
| garneke01_2004 | 4660.3 | |
| jamesle01_2018 | 4585.3 | |
| jamesle01_2009 | 4501.1 | |
| garneke01_2003 | 4464.9 | |
| westbru01_2018 | 4439.8 | |
| bryanko01_2003 | 4439.3 |
This data lends itself quite easily to an SQLite relational database which I setup with RSQLite. First I save the data as a csv in the data folder, and then I establish a local connection and write the table to the database file. In this case I set overwrite = TRUE since I have written to the db before and am just saving it again, but it is generally unnecessary.
readr::write_csv(df, "Data/player_season_totals.csv")
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "nba.db")
upload_data <- readr::read_csv("Data/player_season_totals.csv")
RSQLite::dbWriteTable(conn = con,
name = "player_seasons",
value = upload_data,
overwrite = TRUE)
Now to add data to the table regularly given the oncoming 2021-2022 season I add to the table regularly by running the following script on the server I hosted the data on.
new_df <- nbastatR::bref_players_stats(2022, tables = c("totals", "advanced"), widen = T, assign_to_environment = F)
new_df %>%
janitor::clean_names() %>%
readr::write_csv("Data/latest_season.csv")
I made this graph for fun using the wonderful gtextras package by Tom Mock and just want to include it in the post now.
top5_fg <- new_df %>%
group_by(slug_position) %>%
arrange(desc(pct_fg)) %>%
slice_head(n = 5) %>%
dplyr::filter(slug_position %in% c("C", "SF", "PF", "PG", "SG")) %>%
transmute(Position = slug_position,
url_player_headshot,
Player = name_player,
`% FG` = pct_fg,
`% EFG` = pct_efg,
`% TS` = pct_true_shooting) %>%
gt::gt() %>%
fmt_percent(
columns = c(`% FG`, `% EFG`, `% TS`)
) %>%
gtExtras::gt_color_rows(c(`% FG`, `% EFG`, `% TS`), palette = "rcartocolor::Mint") %>%
text_transform(
locations = cells_body(columns = url_player_headshot),
fn = function(x) {
web_image(
url = x,
height = as.numeric(50)
)
}
) %>%
cols_label(
url_player_headshot = " "
) %>%
tab_header("Top 5 Field Goal % in Each Position in the 2021-2022 Season") %>%
DGThemes::gt_theme_duncan() %>%
suppressWarnings()
top5_fg
| Top 5 Field Goal % in Each Position in the 2021-2022 Season | |||||
|---|---|---|---|---|---|
| Position | Player | % FG | % EFG | % TS | |
| C | |||||
| C | Jarrett Allen | 94.10% | 94.10% | 93.80% | |
| C | Daniel Gafford | 87.50% | 87.50% | 85.80% | |
| C | Dewayne Dedmon | 83.30% | 91.70% | 94.50% | |
| C | Gorgui Dieng | 80.00% | 90.00% | 90.00% | |
| C | Richaun Holmes | 80.00% | 83.30% | 82.70% | |
| PF | |||||
| PF | Nemanja Bjelica | 75.00% | 79.20% | 81.50% | |
| PF | P.J. Tucker | 75.00% | 1.00% | 1.00% | |
| PF | Domantas Sabonis | 71.90% | 81.30% | 81.80% | |
| PF | Aaron Gordon | 66.70% | 70.80% | 72.70% | |
| PF | Jaden McDaniels | 66.70% | 66.70% | 66.70% | |
| PG | |||||
| PG | Jrue Holiday | 71.40% | 85.70% | 85.70% | |
| PG | Patty Mills | 68.80% | 1.00% | 1.00% | |
| PG | Trey Burke | 66.70% | 83.30% | 83.30% | |
| PG | Tre Jones | 66.70% | 66.70% | 66.70% | |
| PG | Ja Morant | 58.60% | 60.30% | 61.90% | |
| SF | |||||
| SF | Thanasis Antetokounmpo | 80.00% | 80.00% | 83.30% | |
| SF | Kenyon Martin Jr. | 75.00% | 75.00% | 71.70% | |
| SF | Deni Avdija | 66.70% | 75.00% | 72.70% | |
| SF | Dalano Banton | 66.70% | 75.00% | 75.00% | |
| SF | Andre Iguodala | 66.70% | 77.80% | 81.00% | |
| SG | |||||
| SG | Anfernee Simons | 83.30% | 91.70% | 91.70% | |
| SG | Seth Curry | 76.50% | 94.10% | 92.30% | |
| SG | John Konchar | 66.70% | 66.70% | 66.70% | |
| SG | Eric Bledsoe | 62.50% | 65.60% | 63.50% | |
| SG | Zach LaVine | 61.10% | 70.80% | 76.70% | |
# top5_fg %>%
# gtsave(here::here("images/top5.png"))
Finally, some machine learning, first we filter down to the relevant variables and start off by using the relevant variables for our model to predict maximum fantasy points.
normal_df <- df %>%
mutate(id = slugPlayerSeason) %>%
transmute(count_games = countGames,
count_games_started = countGamesStarted,
pct_fg = pctFG,
pct_fg3 = pctFG3,
pct_fg2 = pctFG2,
pct_efg = pctEFG,
pct_ft = pctFT,
minutes = minutesTotals,
fgm_total = fgmTotals,
fga_total = fgaTotals,
fg3m_totals = fg3mTotals,
fg3a_totals = fg3aTotals,
fg2m_totals = fg2mTotals,
fg2a_totals = fg2aTotals,
ftm_totals = ftmTotals,
fta_totals = ftaTotals,
orb_totals = orbTotals,
drb_totals = drbTotals,
trb_totals = trbTotals,
ast_totals = astTotals,
stl_totals = stlTotals,
blk_totals = blkTotals,
tov_totals = tovTotals,
pf_totals = pfTotals,
pts_totals = ptsTotals,
fantasy_points = fantasy_points)
Lots of statistics here idk really.
normal_df %>%
pivot_longer(count_games:pts_totals, names_to = "stat", values_to = "value") %>%
ggplot(aes(value, fantasy_points, color = stat)) +
geom_point(alpha = 0.6) +
ggtitle("Fantasy Points by Statistic each Season from the 1999-2000 to 2020-2021 NBA Season") +
labs(y = "Fantasy Points", x = "") +
paletteer::scale_color_paletteer_d(palette = "dichromat::SteppedSequential_5") +
facet_wrap( ~ stat, ncol = 5, scales = "free_x") +
theme(legend.position = "none")
ggsave(here::here("images/eda_points_stat.png"), width = 10, height = 10)
We start by loading the tidymodels metapackage, and splitting the data into training and testing sets.
# Later add a part to strata by 3 point era?
library(tidymodels)
set.seed(2021)
nba_split <- initial_split(normal_df)
nba_train <- training(nba_split)
nba_test <- testing(nba_split)
An XGBoost model is based on trees, so we don’t have to do anymore preprocessing than has already been done on the data, we can go straight into model specification and hyperparameter tuning.
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost") %>%
set_mode("regression")
xgb_spec
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Computational engine: xgboost
Thats a lot right there so lets create a space-filling design to cover the hyperparameter space as efficiently as possible.
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), nba_train),
learn_rate(),
size = 30
)
xgb_grid
## # A tibble: 30 × 6
## tree_depth min_n loss_reduction sample_size mtry learn_rate
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 15 18 3.42e- 8 0.548 9 1.03e- 4
## 2 3 4 6.00e- 4 0.958 7 8.72e- 8
## 3 2 31 1.87e- 8 0.344 6 4.87e-10
## 4 11 27 2.00e- 6 0.857 4 8.89e- 4
## 5 2 33 3.96e- 3 0.924 2 2.24e- 6
## 6 6 15 3.71e- 5 0.804 8 2.13e- 7
## 7 9 5 4.59e- 7 0.377 24 1.39e- 8
## 8 12 6 3.54e- 2 0.270 14 7.59e- 2
## 9 12 12 8.89e-10 0.763 18 1.92e-10
## 10 4 9 1.79e- 1 0.121 18 3.51e- 3
## # … with 20 more rows
Note that mtry() had to be treated differently because it depends on the number of predictors in the data.
The model specification in the workflow must be put in for convenience. since theres no preprocessing necessary here we can go straight to add_formula() as our data preprocessor.
xgb_wf <- workflow() %>%
add_formula(fantasy_points ~ .) %>%
add_model(xgb_spec)
xgb_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## fantasy_points ~ .
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Computational engine: xgboost
Next we create the cross-validation resamples for tuning the model.
set.seed(2021)
nba_folds <- vfold_cv(nba_train)
nba_folds
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [7007/779]> Fold01
## 2 <split [7007/779]> Fold02
## 3 <split [7007/779]> Fold03
## 4 <split [7007/779]> Fold04
## 5 <split [7007/779]> Fold05
## 6 <split [7007/779]> Fold06
## 7 <split [7008/778]> Fold07
## 8 <split [7008/778]> Fold08
## 9 <split [7008/778]> Fold09
## 10 <split [7008/778]> Fold10
Now we get ready for the big task, tuning the grid with tune_grid() using the workflow, resamples, and grid of paramaters to try. We use control_grid(save_pred = T) to explore predictions afterwards.
doParallel::registerDoParallel()
set.seed(2021)
xgb_res <- tune_grid(
xgb_wf,
resamples = nba_folds,
grid = xgb_grid,
control = control_grid(save_pred = T)
)
xgb_res
## # Tuning results
## # 10-fold cross-validation
## # A tibble: 10 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [7007/779]> Fold01 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 2 <split [7007/779]> Fold02 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 3 <split [7007/779]> Fold03 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 4 <split [7007/779]> Fold04 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 5 <split [7007/779]> Fold05 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 6 <split [7007/779]> Fold06 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,37…
## 7 <split [7008/778]> Fold07 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
## 8 <split [7008/778]> Fold08 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
## 9 <split [7008/778]> Fold09 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
## 10 <split [7008/778]> Fold10 <tibble [60 × 10]> <tibble [0 × 1]> <tibble [23,34…
It’s easy to explore all metrics for the models with collect_metrics()!
collect_metrics(xgb_res)
## # A tibble: 60 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 9 18 15 1.03e- 4 0.0000000342 0.548 rmse
## 2 9 18 15 1.03e- 4 0.0000000342 0.548 rsq
## 3 7 4 3 8.72e- 8 0.000600 0.958 rmse
## 4 7 4 3 8.72e- 8 0.000600 0.958 rsq
## 5 6 31 2 4.87e-10 0.0000000187 0.344 rmse
## 6 6 31 2 4.87e-10 0.0000000187 0.344 rsq
## 7 4 27 11 8.89e- 4 0.00000200 0.857 rmse
## 8 4 27 11 8.89e- 4 0.00000200 0.857 rsq
## 9 2 33 2 2.24e- 6 0.00396 0.924 rmse
## 10 2 33 2 2.24e- 6 0.00396 0.924 rsq
## # … with 50 more rows, and 5 more variables: .estimator <chr>, mean <dbl>,
## # n <int>, std_err <dbl>, .config <chr>
We can visualize these easily as well to understand the results
xgb_res %>%
collect_metrics() %>%
filter(.metric == "rsq") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "R Squared")
And let tidymodels select the best performing set of paramaters for us with show_best(), its pretty convenient that the best rmse is also the best r-squared one.
show_best(xgb_res, "rmse");show_best(xgb_res, "rsq")
## # A tibble: 5 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 23 10 4 0.0218 0.0142 0.322 rmse
## 2 14 6 12 0.0759 0.0354 0.270 rmse
## 3 11 35 13 0.0488 12.8 0.244 rmse
## 4 21 20 5 0.00920 0.000000213 0.997 rmse
## 5 18 9 4 0.00351 0.179 0.121 rmse
## # … with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## # std_err <dbl>, .config <chr>
## # A tibble: 5 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 23 10 4 0.0218 1.42e- 2 0.322 rsq
## 2 14 6 12 0.0759 3.54e- 2 0.270 rsq
## 3 11 35 13 0.0488 1.28e+ 1 0.244 rsq
## 4 21 20 5 0.00920 2.13e- 7 0.997 rsq
## 5 19 17 14 0.00303 1.97e-10 0.189 rsq
## # … with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## # std_err <dbl>, .config <chr>
Or the best performing model overall
best_rsq <- select_best(xgb_res, "rsq")
best_rsq
## # A tibble: 1 × 7
## mtry min_n tree_depth learn_rate loss_reduction sample_size .config
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 23 10 4 0.0218 0.0142 0.322 Preprocessor1_Mo…
Now to finalize the tuneable workflow with the paramater values we have
final_xgb <- finalize_workflow(
xgb_wf,
best_rsq
)
final_xgb
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## fantasy_points ~ .
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = 23
## trees = 1000
## min_n = 10
## tree_depth = 4
## learn_rate = 0.0217749924129286
## loss_reduction = 0.0141520961395455
## sample_size = 0.321661404399201
##
## Computational engine: xgboost
Instead of tune() placeholders, we can use real values for the models hyperparameters.
Which of the parameters is most important for variable importance?
library(vip)
final_xgb %>%
fit(data = nba_train) %>%
pull_workflow_fit() %>%
vip(geom = "point")
Looks like minutes has the most importance, and then points, and then field goal makes.
And to save it with xgboost::xgb.save()! It can be reloaded later with xgboost::xgb.load().
final_xgb %>%
fit(data = nba_train) %>%
extract_fit_engine() %>%
xgboost::xgb.save(., here::here("models/normal_model"))
## [1] TRUE
Finally lets use last_fit() to fit the model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.
final_res <- last_fit(final_xgb, nba_split)
collect_metrics(final_res)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 34.5 Preprocessor1_Model1
## 2 rsq standard 0.999 Preprocessor1_Model1
Our results indicate some overfitting we will find out. Here we use an ROC curve for the testing set. Remember that the ROC curve shows the true positive rate vs the false positive rate
final_res %>%
collect_predictions() %>%
ggplot(aes(fantasy_points, .pred, color = id)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = 0.2) +
labs(x = "Truth", y = "Predicted Value",
color = NULL,
title = "Predicted and True Values for Fantasy Points")
But lets go ahead and look at some of the predictions for this year:
new_df <- readr::read_csv(here::here("Data/latest_season.csv")) %>%
dplyr::select(!where(is.factor)) %>%
rename(fgm_total = fgm_totals, fga_total = fga_totals)
fitted_wf <- pluck(final_res$.workflow, 1)
preds <- predict(fitted_wf, new_data = new_df) %>%
as_vector()
pred_df <- new_df %>%
bind_cols(predictions = preds)
Here are the current projections based on totals, it looks pretty reasonable at the top but lower down I’m seeing some difficulties, and I think that the models reliance on time played is really impacting its performance.
pred_df %>%
arrange(desc(predictions)) %>%
dplyr::slice(c(1:100)) %>%
ggplot(aes(fct_reorder(name_player, predictions), predictions)) +
geom_col(color = "#3EB489", alpha = 0.8, fill = "#3EB489") +
coord_flip() +
labs(x = "Player", y = "Fantasy Points Prediction", title = "Top 100 Predictions Based on Season Totals") +
theme(legend.position = "none",
axis.text.y = element_text(size = 7))